Interactive Assignment: World Life Expectancy

Data Management

This dataset is based on a multitude of datasets to create “World Life Expectancy”. This is based on merging and selecting variables from four specific datasets “Income per Person”, “Life Expectancy in Years”, “Population size”, and “Country Regions”. These are all possible factors that can have an impact on one’s life expectancy in different areas of the world. The data was collected for the years of 1800-2018.

The final data set was used from the previous week’s assignment to fullfill the following conditions:

  1. Reshape data set: Income Per Person to make a longitudinal data such that the resulting data set has three columns: country, year, and income.
  2. Do the same for Life Expectancy in Years so that the resulting data set has three columns: country, year, and life expectancy.
  3. Merge/join the above two longitudinal data sets to make a new data set, under name LifeExpIncom that has variables: country, year, lifeExp, and income.
  4. Merge LifeExpIncom with country region so that the final data set has information about income, life expectancy, and country region.
  5. Merge the previous resulting data set with population size so that the final data set has information about income, life expectancy, population size, and country region.
##a
long_income <- income %>%
               pivot_longer(col= starts_with("X"),
                            names_to = "year",
                            values_to = "income")%>%
                rename_at('geo', ~'country')

##b
long_life <- life_ex %>%
               pivot_longer(col= starts_with("X"),
                            names_to = "year",
                            values_to = "life expectancy")%>%
                rename_at('geo', ~'country')
               

##c
LifeExpIncom <- merge(long_income, long_life, by=c("country","year"))

##d
d <- country_tot %>%
  rename_at('name', ~'country')

combo <-full_join(LifeExpIncom, d, by="country")
combo <- combo %>%
  dplyr::select("income", "life expectancy","country", "region", "year")
##e
long_pop <- pop_tot %>%
               pivot_longer(col= starts_with("X"),
                            names_to = "year",
                            values_to = "population size")%>%
                rename_at('geo', ~'country')

cleaned <-merge(combo, long_pop, by=c("country","year"))
cleaned_final <- cleaned %>%
  dplyr::select("income", "life expectancy","population size", "region")

library(data.table)

library(readr)
# Write to CSV file
write_csv(cleaned_final, "C:/Users/jessg/OneDrive - West Chester University of PA/sta553/wk5/datasets/cleaned_final.csv")

The final data set has a total of 40953 observations of 4 different variables. These variables are: income, life expectancy, population size,and region. The first three variables are numeric, and region is categorical.

Interactive Plot

Now that all four data sets have been merged/joined to create a singular dataset, “cleaned_data”, it is time to make an interactive scatterplot.

Before we can use the data, we must subset one more time to only account for the year 2015. The following code chunk will show this step.

data2015 <- cleaned %>%
  filter(year=='X2015')%>%
  dplyr::select("income", "life expectancy","population size", "region")

# Write to CSV file
write_csv(data2015, "C:/Users/jessg/OneDrive - West Chester University of PA/sta553/wk6/datasets/data2015.csv")

The final sub-setted data set named, data2015, has a total of 187 observation with four variables.

The code chunck below will create an interactive scatterplot vizulation using plotly with the following conditions: a. Set the point size to be proportional to the population size b. Use different colors for different countries. c. Choose an appropriate transparency level so that overlapped points can be viewed. d. Choose an appropriate color to highlight the point boundary so that partially overlapped points can be easily distinguished. e. Include the country name and population size in the hover text.

names(data2015)[names(data2015)=="life expectancy"] <- "life"
names(data2015)[names(data2015)=="population size"] <- "size"
fig <-plot_ly(
    data = data2015,
    x = ~life,         # Horizontal axis 
    y = ~income,          # Vertical axis 
    color = ~factor(region),  # must be a numeric factor
     hovertext = ~region,          # show the species in the hovertext
     ####
     marker = list(size =~size/10000000,  sizeref = .05, sizemode = 'area'),
    customdata = ~region,
     alpha  = 0.9,
     type = "scatter",
     mode = "markers",
    
     #
     ## using the following hovertemplate() to add the information of the
     ## Two numerical variables to the hover text.
     hovertemplate = paste('<i><b>Income<b></i>: %{y}',
                           '<br><b>Life Expectancy</b>:  %{x}',
                           '<br><b>Region</b>: %{customdata}'))%>%
                          layout(title = 'Life Expectancy vs Income in 2015', plot_bgcolor = "white")

fig

After looking at the scatter plot, there is a clear indication that life expectancy and income are exponentially correlated. When looking at the different regions, Africa by far is not living as long as the other regions. This also makes sense as Africa makes the least amount of money. This show how much of an impact income has on life expectancy. Asia has the highest porportion of people living longer, as well as higher incomes. Europe also has the second highest amount of people living longer, and has higher incomes. This can be caused by outside factors such as healthcare, pollution, and war.

names(cleaned)[names(cleaned)=="life expectancy"] <- "life"
names(cleaned)[names(cleaned)=="population size"] <- "size"

pal.IBM <- c("#332288", "#117733", "#0072B2","#D55E00", "#882255")
pal.IBM <- setNames(pal.IBM, c("Asia", "Europe", "Africa", "Americas", "Oceania"))


fig2 <-
  plot_ly(
    data = cleaned,
    x = ~life, 
    y = ~income, 
    #size = ~(2*log(size)-11)^2,
    color = ~region, 
    colors = pal.IBM,   # custom colors
    marker = list(size =~size/10000000,  sizemode = 'area'),
    frame = ~year,      # the time variable to
    # to display in the hover
    text = ~paste("Region:", region,
                  "<br>Year:", year,
                  "<br>LifeExp:", life,
                  "<br>Pop:", size,
                  "<br>Income:", income),
    hoverinfo = "text",
    type = 'scatter',
    mode = 'markers'
  )


#fig2